{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Playing Atari games using Categorical DQN\n",
    "\n",
    "Let's implement the categorical DQN algorithm for playing the Atari games. The code used\n",
    "in this section is adapted from open-source categorical DQN implementation - \n",
    "https://github.com/princewen/tensorflow_practice/tree/master/RL/Basic-DisRLDemo provided by Prince Wen. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.0.0\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import numpy as np\n",
    "import random\n",
    "from collections import deque\n",
    "import math\n",
    "\n",
    "import tensorflow.compat.v1 as tf\n",
    "tf.disable_v2_behavior()\n",
    "\n",
    "import gym\n",
    "from tensorflow.python.framework import ops"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the convolutional layer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def conv(inputs, kernel_shape, bias_shape, strides, weights, bias=None, activation=tf.nn.relu):\n",
    "\n",
    "    weights = tf.get_variable('weights', shape=kernel_shape, initializer=weights)\n",
    "    conv = tf.nn.conv2d(inputs, weights, strides=strides, padding='SAME')\n",
    "    if bias_shape is not None:\n",
    "        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\n",
    "        return activation(conv + biases) if activation is not None else conv + biases\n",
    "    \n",
    "    return activation(conv) if activation is not None else conv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the dense layer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dense(inputs, units, bias_shape, weights, bias=None, activation=tf.nn.relu):\n",
    "    \n",
    "    if not isinstance(inputs, ops.Tensor):\n",
    "        inputs = ops.convert_to_tensor(inputs, dtype='float')\n",
    "    if len(inputs.shape) > 2:\n",
    "        inputs = tf.layers.flatten(inputs)\n",
    "    flatten_shape = inputs.shape[1]\n",
    "    weights = tf.get_variable('weights', shape=[flatten_shape, units], initializer=weights)\n",
    "    dense = tf.matmul(inputs, weights)\n",
    "    if bias_shape is not None:\n",
    "        assert bias_shape[0] == units\n",
    "        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\n",
    "        return activation(dense + biases) if activation is not None else dense + biases\n",
    "    \n",
    "    return activation(dense) if activation is not None else dense\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the variables\n",
    "\n",
    "Now, let's define some of the important variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the $V_{min}$ and $V_{max}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "v_min = 0\n",
    "v_max = 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the number of atoms (supports) $N$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "atoms = 51"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the discount factor, $\\gamma$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "gamma = 0.99 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the batch size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the time step at which we want to update the target network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "update_target_net = 50 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the epsilon value which is used in the epsilon-greedy policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "epsilon = 0.5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the replay buffer\n",
    "\n",
    "First, let's define the buffer length:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "buffer_length = 20000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the replay buffer as a deque structure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "replay_buffer = deque(maxlen=buffer_length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define a function called sample_transitions which returns the randomly sampled\n",
    "minibatch of transitions from the replay buffer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_transitions(batch_size):\n",
    "    batch = np.random.permutation(len(replay_buffer))[:batch_size]\n",
    "    trans = np.array(replay_buffer)[batch]\n",
    "    return trans"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the Categorical DQN class\n",
    "\n",
    "Let's define the class called `Categorical_DQN` where we will implement the categorical\n",
    "DQN algorithm. For a clear understanding, you can also check the detailed explanation of code on the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Categorical_DQN():\n",
    "    \n",
    "    #first, let's define the init method\n",
    "    def __init__(self,env):\n",
    "        \n",
    "        #start the TensorFlow session\n",
    "        self.sess = tf.InteractiveSession()\n",
    "        \n",
    "        #initialize v_min and v_max\n",
    "        self.v_max = v_max\n",
    "        self.v_min = v_min\n",
    "        \n",
    "        #initialize the number of atoms\n",
    "        self.atoms = atoms \n",
    "        \n",
    "        #initialize the epsilon value\n",
    "        self.epsilon = epsilon\n",
    "        \n",
    "        #get the state shape of the environment\n",
    "        self.state_shape = env.observation_space.shape\n",
    "        \n",
    "        #get the action shape of the environment\n",
    "        self.action_shape = env.action_space.n\n",
    "\n",
    "        #initialize the time step:\n",
    "        self.time_step = 0\n",
    "        \n",
    "        #initialize the target state shape\n",
    "        target_state_shape = [1]\n",
    "        target_state_shape.extend(self.state_shape)\n",
    "\n",
    "        #define the placeholder for the state\n",
    "        self.state_ph = tf.placeholder(tf.float32,target_state_shape)\n",
    "        \n",
    "        #define the placeholder for the action\n",
    "        self.action_ph = tf.placeholder(tf.int32,[1,1])\n",
    "                                       \n",
    "        #define the placeholder for the m value (distributed probability of target distribution)\n",
    "        self.m_ph = tf.placeholder(tf.float32,[self.atoms])\n",
    "    \n",
    "        #compute delta z\n",
    "        self.delta_z = (self.v_max - self.v_min) / (self.atoms - 1)\n",
    "                                       \n",
    "        #compute the support values\n",
    "        self.z = [self.v_min + i * self.delta_z for i in range(self.atoms)]\n",
    "\n",
    "        self.build_categorical_DQN()\n",
    "                                       \n",
    "        #initialize all the TensorFlow variables\n",
    "        self.sess.run(tf.global_variables_initializer())\n",
    "\n",
    "        \n",
    "    #let's define a function called build_network for building a deep network. Since we are\n",
    "    #dealing with the Atari games, we use the convolutional neural network\n",
    "                                       \n",
    "    def build_network(self, state, action, name, units_1, units_2, weights, bias, reg=None):\n",
    "                                       \n",
    "        #define the first convolutional layer\n",
    "        with tf.variable_scope('conv1'):\n",
    "            conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1], weights, bias)\n",
    "                                       \n",
    "        #define the second convolutional layer\n",
    "        with tf.variable_scope('conv2'):\n",
    "            conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1], weights, bias)\n",
    "                                       \n",
    "        #flatten the feature maps obtained as a result of the second convolutional layer\n",
    "        with tf.variable_scope('flatten'):\n",
    "            flatten = tf.layers.flatten(conv2)\n",
    "    \n",
    "        #define the first dense layer\n",
    "        with tf.variable_scope('dense1'):\n",
    "            dense1 = dense(flatten, units_1, [units_1], weights, bias)\n",
    "                                       \n",
    "        #define the second dense layer\n",
    "        with tf.variable_scope('dense2'):\n",
    "            dense2 = dense(dense1, units_2, [units_2], weights, bias)\n",
    "                                       \n",
    "        #concatenate the second dense layer with the action\n",
    "        with tf.variable_scope('concat'):\n",
    "            concatenated = tf.concat([dense2, tf.cast(action, tf.float32)], 1)\n",
    "                                       \n",
    "        #define the third layer and apply the softmax function to the result of the third layer and\n",
    "        #obtain the probabilities for each of the atoms\n",
    "        with tf.variable_scope('dense3'):\n",
    "            dense3 = dense(concatenated, self.atoms, [self.atoms], weights, bias) \n",
    "        return tf.nn.softmax(dense3)\n",
    "\n",
    "    #now, let's define a function called build_categorical_DQNfor building the main and\n",
    "    #target categorical deep Q networks\n",
    "                                       \n",
    "    def build_categorical_DQN(self):      \n",
    "                                       \n",
    "        #define the main categorical DQN and obtain the probabilities\n",
    "        with tf.variable_scope('main_net'):\n",
    "            name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\n",
    "            weights = tf.random_uniform_initializer(-0.1,0.1)\n",
    "            bias = tf.constant_initializer(0.1)\n",
    "\n",
    "            self.main_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\n",
    "                                       \n",
    "        #define the target categorical DQN and obtain the probabilities\n",
    "        with tf.variable_scope('target_net'):\n",
    "            name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\n",
    "\n",
    "            weights = tf.random_uniform_initializer(-0.1,0.1)\n",
    "            bias = tf.constant_initializer(0.1)\n",
    "\n",
    "            self.target_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\n",
    "\n",
    "        #compute the main Q value with probabilities obtained from the main categorical DQN\n",
    "        self.main_Q = tf.reduce_sum(self.main_p * self.z)\n",
    "                                    \n",
    "        #similarly, compute the target Q value with probabilities obtained from the target categorical DQN \n",
    "        self.target_Q = tf.reduce_sum(self.target_p * self.z)\n",
    "        \n",
    "        #define the cross entropy loss\n",
    "        self.cross_entropy_loss = -tf.reduce_sum(self.m_ph * tf.log(self.main_p))\n",
    "        \n",
    "        #define the optimizer and minimize the cross entropy loss using Adam optimizer\n",
    "        self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.cross_entropy_loss)\n",
    "    \n",
    "        #get the main network parameters\n",
    "        main_net_params = tf.get_collection(\"main_net_params\")\n",
    "        \n",
    "        #get the target network parameters\n",
    "        target_net_params = tf.get_collection('target_net_params')\n",
    "        \n",
    "        #define the update_target_net operation for updating the target network parameters by\n",
    "        #copying the parameters of the main network\n",
    "        self.update_target_net = [tf.assign(t, e) for t, e in zip(target_net_params, main_net_params)]\n",
    "\n",
    "    #let's define a function called train to train the network\n",
    "    def train(self,s,r,action,s_,gamma):\n",
    "        \n",
    "        #increment the time step\n",
    "        self.time_step += 1\n",
    "    \n",
    "        #get the target Q values\n",
    "        list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]\n",
    "        \n",
    "        #select the next state action a dash as the one which has the maximum Q value\n",
    "        a_ = tf.argmax(list_q_).eval()\n",
    "        \n",
    "        #initialize an array m with shape as the number of support with zero values. The denotes\n",
    "        #the distributed probability of the target distribution after the projection step\n",
    "\n",
    "        m = np.zeros(self.atoms)\n",
    "        \n",
    "        #get the probability for each atom using the target categorical DQN\n",
    "        p = self.sess.run(self.target_p,feed_dict = {self.state_ph:[s_],self.action_ph:[[a_]]})[0]\n",
    "        \n",
    "        #perform the projection step\n",
    "        for j in range(self.atoms):\n",
    "            Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))\n",
    "            bj = (Tz - self.v_min) / self.delta_z \n",
    "            l,u = math.floor(bj),math.ceil(bj) \n",
    "\n",
    "            pj = p[j]\n",
    "\n",
    "            m[int(l)] += pj * (u - bj)\n",
    "            m[int(u)] += pj * (bj - l)\n",
    "    \n",
    "        #train the network by minimizing the loss\n",
    "        self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] , self.action_ph:[action], self.m_ph: m })\n",
    "        \n",
    "        #update the target network parameters by copying the main network parameters\n",
    "        if self.time_step % update_target_net == 0:\n",
    "            self.sess.run(self.update_target_net)\n",
    "    \n",
    "    #let's define a function called select_action for selecting the action. We generate a random number and if the number is less than epsilon we select the random\n",
    "    #action else we select the action which has maximum Q value.\n",
    "    def select_action(self,s):\n",
    "        if random.random() <= self.epsilon:\n",
    "            return random.randint(0, self.action_shape - 1)\n",
    "        else: \n",
    "            return np.argmax([self.sess.run(self.main_Q,feed_dict={self.state_ph:[s],self.action_ph:[[a]]}) for a in range(self.action_shape)])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the network\n",
    "\n",
    "Now, let's start training the network. First, create the Atari game environment using the\n",
    "gym. Let's create a Tennis game environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make(\"Tennis-v0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create an object to our `Categorical_DQN` class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "agent = Categorical_DQN(env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of episodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 800"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#for each episode\n",
    "for i in range(num_episodes):\n",
    "    \n",
    "    #set done to False\n",
    "    done = False\n",
    "    \n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #initialize the return\n",
    "    Return = 0\n",
    "    \n",
    "    #while the episode is not over\n",
    "    while not done:\n",
    "        \n",
    "        #render the environment\n",
    "        env.render()\n",
    "        \n",
    "        #select an action\n",
    "        action = agent.select_action(state)\n",
    "        \n",
    "        #perform the selected action\n",
    "        next_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        #update the return\n",
    "        Return = Return + reward\n",
    "        \n",
    "        #store the transition information into the replay buffer\n",
    "        replay_buffer.append([state, reward, [action], next_state])\n",
    "        \n",
    "        #if the length of the replay buffer is greater than or equal to buffer size then start training the\n",
    "        #network by sampling transitions from the replay buffer\n",
    "        if len(replay_buffer) >= batch_size:\n",
    "            trans = sample_transitions(2)\n",
    "            for item in trans:\n",
    "                agent.train(item[0],item[1], item[2], item[3],gamma)\n",
    "                \n",
    "        #update the state to the next state\n",
    "        state = next_state\n",
    "    \n",
    "    #print the return obtained in the episode\n",
    "    print(\"Episode:{}, Return: {}\".format(i,Return))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we learned how categorical DQN works and how to implement them, in the next\n",
    "section, we will learn another interesting algorithm."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
